Kenneth Tay
Oct 9, 2018
Goal: Demonstrate that you know how to do data analysis in R
Minimum requirements:
vec <- c("a", "b", "c")
vec
## [1] "a" "b" "c"
vec[c(2,4)]
## [1] "b" NA
classes <- list(quarter = "Fall 2018/19",
ID = c("STATS 32", "STATS 101", "STATS 200"),
credits = 12)
classes$ID
## [1] "STATS 32" "STATS 101" "STATS 200"
classes[["credits"]]
## [1] 12
A special type of list:
data(mtcars)
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
str
, summary
head
, tail
names
, dim
, nrow
, ncol
table
mean
, median
, sd
, var
factor
I want all the rows such that the value of the cyl
column is equal to 2:
vehicles[vehicles$cyl == 2, ]
df
## A B
## 1 1 a
## 2 2 b
## 3 3 c
## 4 NA d
## 5 NA <NA>
df$A == 2
## [1] FALSE TRUE FALSE NA NA
df[df$A == 2, ]
## A B
## 2 2 b
## NA NA <NA>
## NA.1 NA <NA>
Fix 1: test that the value is not NA and is equal to 2
df[!is.na(df$A) & df$A == 2, ]
## A B
## 2 2 b
Fix 2: use the which
function
which(df$A == 2)
## [1] 2
df[which(df$A == 2), ]
## A B
## 2 2 b
E.g. Take the mean of c(1,3,NA)
.
mean(c(1,3,NA))
## [1] NA
mean(c(1,3,NA), na.rm = TRUE)
## [1] 2
ggplot2
(and the +
syntax)“The simple graph has brought more information to the data analyst’s mind than any other device.” - John Tukey
## mpg weight cylinders
## 1 21.0 2.620 6
## 2 21.0 2.875 6
## 3 22.8 2.320 4
## 4 21.4 3.215 6
## 5 18.7 3.440 8
## 6 18.1 3.460 6
## 7 14.3 3.570 8
## 8 24.4 3.190 4
## 9 22.8 3.150 4
## 10 19.2 3.440 6
## 11 17.8 3.440 6
## 12 16.4 4.070 8
## 13 17.3 3.730 8
## 14 15.2 3.780 8
## 15 10.4 5.250 8
## 16 10.4 5.424 8
## 17 14.7 5.345 8
## 18 32.4 2.200 4
## 19 30.4 1.615 4
## 20 33.9 1.835 4
## 21 21.5 2.465 4
## 22 15.5 3.520 8
## 23 15.2 3.435 8
## 24 13.3 3.840 8
## 25 19.2 3.845 8
## 26 27.3 1.935 4
## 27 26.0 2.140 4
## 28 30.4 1.513 4
## 29 15.8 3.170 8
## 30 19.7 2.770 6
## 31 15.0 3.570 8
## 32 21.4 2.780 4
What is the distribution of cylinders in my dataset?
What is the distribution of miles per gallon
in my dataset?
What is the relationship between mpg
and weight
?
What is the relationship between mpg
and time?
Not so good…
Easier to see the trend
For each value of cylinder, what is the distribution of mpg
like?
I have father-son pairs. For each pair, I record their height and weight, as well as their ethnicities. I want to study the relationship between characteristics of the father and that of the son. What plots could help me?
ggplot2
ggplot2
packageggplot2
reference manualData: Dataset we are using for the plot
## mpg weight cylinders
## 1 21.0 2.620 6
## 2 21.0 2.875 6
## 3 22.8 2.320 4
## 4 21.4 3.215 6
## 5 18.7 3.440 8
## 6 18.1 3.460 6
## 7 14.3 3.570 8
## 8 24.4 3.190 4
## 9 22.8 3.150 4
## 10 19.2 3.440 6
Geometries: Visual elements used for our data
Geom: point
Aesthetics: Defines the data columns which affect various aspects of the geom
3 different aesthetics:
We can have more than one layer in a graphic.
= +
Each layer contains (essentially):
ggplot2
code: take 1Making use of ggplot
’s sensible defaults:
ggplot() +
geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_point(data = df, mapping = aes(x = cylinders, y = mpg))
ggplot2
code: take 2Using jitter to avoid “overplotting”:
ggplot() +
geom_boxplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_point(data = df, mapping = aes(x = cylinders, y = mpg),
position = "jitter")
ggplot2
code: take 3When layers share attributes, we only have to type them once:
ggplot(data = df, mapping = aes(x = cylinders, y = mpg)) +
geom_boxplot() +
geom_point(position = "jitter")
Optional material
One graphic contains:
Behind the scenes, R may need to do some transformation on the dataset to make the graphic.
Sometimes we need to tweak the position of the geometric elements because they obscure each other.
Only 9 data points??
Much better
Default colors
Manually chosen colors
Default axis limits
Manually chosen axis limits
Refers to all non-data ink
ggplot2
’s default theme
Minimal theme
Classic theme
Dark theme
rgb(0,0,1)
, rgb(1,0,0)
, rgb(0,0,0)
, rgb(1,1,1)